-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record and use the tail size to prefetch table tail #11406
Conversation
ca7ff4e
to
55d1966
Compare
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
55d1966
to
0ff26df
Compare
@hx235 has updated the pull request. You must reimport the pull request before landing. |
tail_prefetch_size); | ||
} else { | ||
tail_prefetch_size = prefetch_all || preload_all | ||
? static_cast<size_t>(4 * 1024 + 0.01 * file_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@anand1976 @ajkr I welcome any better heuristics for the case where we don't have tail_prefetch_size e.g, backward compatibility. I used the current one simply because someone provided it in the task about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry a little bit about smooth upgrade. When someone upgrades to this version, they'll initially get this crude heuristic rather than the old adaptive TailPrefetchStats. How hard would it be to leave TailPrefetchStats for a release or two as a fall-back when the new data is not available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - it's not difficult to leave TailPrefetchStats for a couple more release as a fallback and I will fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed by first falling back to TailPrefetchStats and last to new heuristics static_cast<size_t>(4 * 1024 + 0.01 * file_size)
@ajkr pretty much ready for review except it's currently going through stress test just in case. Will let you know when the stress test is done. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
tail_prefetch_size = prefetch_all || preload_all ? 512 * 1024 : 4 * 1024; | ||
if (tail_start_offset != 0) { | ||
tail_prefetch_size = file_size - tail_start_offset; | ||
} else if (contain_no_data_block) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Note] this is for the case where the file only contains range deletion entries so we have nothing in data blocks where "0" is not a fallback but informed number.
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, but some suggestions
HISTORY.md
Outdated
@@ -3,6 +3,9 @@ | |||
### New Features | |||
* Introduced a new option `block_protection_bytes_per_key`, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287). | |||
|
|||
### Performance Improvements | |||
* Record the starting offset of block-based table's tail (i.e, all blocks after data blocks till the end) in manifest and use it to prefetch the tail more accurately during table open instead of relying on heuristics (#11406).Heuristics will still be used for now as a fallback for backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"for now"
Perhaps say on files written by previous versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Status s = TryReopen(options); | ||
if (use_direct_io && (s.IsNotSupported() || s.IsInvalidArgument())) { | ||
// If direct IO is not supported, skip the test | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little better to use ROCKSDB_GTEST_BYPASS so that at least when running the test manually you can see if it unexpectedly bypassed the full test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
||
ASSERT_OK(Put("k1", "v1")); | ||
|
||
HistogramData pre_flush_file_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it a bit simpler to Reset() the statistics at the start of what you're interested in, but this is also OK
tail_prefetch_size = file_size - tail_start_offset; | ||
} else if (contain_no_data_block) { | ||
tail_prefetch_size = file_size - tail_start_offset; | ||
ROCKS_LOG_WARN(logger, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a warning here. A code comment would be fine IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
tail_prefetch_size = prefetch_all || preload_all ? 512 * 1024 : 4 * 1024; | ||
if (tail_start_offset != 0) { | ||
tail_prefetch_size = file_size - tail_start_offset; | ||
} else if (contain_no_data_block) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be an annoying change at this point, but recording the tail size rather than the offset would not have the same issue with 0 meaning either "unset" or "starting at offset 0." There's also a tiny manifest size advantage as the tail size tends to be much smaller than the tail offset (then encoded with a variable length encoding). Getting rid of tracking contain_no_data_block would probably be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will look into this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
if (!s.ok()) { | ||
return s; | ||
if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled()) { | ||
rep->CreateFilePrefetchBuffer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this code still covered by unit tests? (I sometimes throw in an assert(false);
to test that.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - a couple tests do, including my TEST_P(PrefetchTest, BlockBasedTableTailPrefetch)
@@ -239,6 +240,10 @@ struct TableProperties { | |||
// 0 means not exists. | |||
uint64_t external_sst_file_global_seqno_offset = 0; | |||
|
|||
// Offset where the "tail" part of SST file starts | |||
// "Tail" refers to all blocks after data blocks till the end of the SST file | |||
uint64_t tail_start_offset = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation for adding this appears to be for SST ingest/import, which is reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, mostly for SST ingest/import/repair
HISTORY.md
Outdated
@@ -3,6 +3,9 @@ | |||
### New Features | |||
* Introduced a new option `block_protection_bytes_per_key`, which can be used to enable per key-value integrity protection for in-memory blocks in block cache (#11287). | |||
|
|||
### Performance Improvements | |||
* Record the starting offset of block-based table's tail (i.e, all blocks after data blocks till the end) in manifest and use it to prefetch the tail more accurately during table open instead of relying on heuristics (#11406).Heuristics will still be used for now as a fallback for backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I prefer the release notes not to talk too much about low level details but instead the user experience. For example, "Improved the I/O efficiency of prefetching SST metadata by recording more information in the DB manifest. Opening files written with previous versions will still rely on heuristics for how much to prefetch."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I failed to realize the complication of being unable to determine the tail size at time of writing it out as a table property. We're discussing possible remedies offline.
@hx235 has updated the pull request. You must reimport the pull request before landing. |
@hx235 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Fixed by preserving tail_start_offset in table properties so ingest/import/repair could deduct tail_size from it. |
Summary: **Context:** [PR11406](#11406) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced - [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on) - more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in **Summary:** - Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()` - Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior Pull Request resolved: #11516 Test Plan: - db bench Create db with small files prior to PR 11406 ``` ./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd` ``` Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open. ``` ./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd ``` Pre-PR: ``` rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539 ``` Post-PR: ``` rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349 ``` _Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_ ``` class PosixRandomAccessFile : public FSRandomAccessFile { virtual size_t GetRequiredBufferAlignment() const override { - return logical_sector_size_; + return 1; } ``` - CI Reviewed By: pdillinger Differential Revision: D46472566 Pulled By: hx235 fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
Summary: **Context:** [PR11406](#11406) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced - [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on) - more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in **Summary:** - Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()` - Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior Pull Request resolved: #11516 Test Plan: - db bench Create db with small files prior to PR 11406 ``` ./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd` ``` Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open. ``` ./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd ``` Pre-PR: ``` rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539 ``` Post-PR: ``` rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349 ``` _Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_ ``` class PosixRandomAccessFile : public FSRandomAccessFile { virtual size_t GetRequiredBufferAlignment() const override { - return logical_sector_size_; + return 1; } ``` - CI Reviewed By: pdillinger Differential Revision: D46472566 Pulled By: hx235 fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
Summary: **Context:** [PR11406](#11406) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced - [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on) - more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in **Summary:** - Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()` - Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior Pull Request resolved: #11516 Test Plan: - db bench Create db with small files prior to PR 11406 ``` ./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd` ``` Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open. ``` ./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd ``` Pre-PR: ``` rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539 ``` Post-PR: ``` rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349 ``` _Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_ ``` class PosixRandomAccessFile : public FSRandomAccessFile { virtual size_t GetRequiredBufferAlignment() const override { - return logical_sector_size_; + return 1; } ``` - CI Reviewed By: pdillinger Differential Revision: D46472566 Pulled By: hx235 fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
Summary: **Context:** [PR11406](#11406) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced - [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on) - more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in **Summary:** - Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()` - Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior Pull Request resolved: #11516 Test Plan: - db bench Create db with small files prior to PR 11406 ``` ./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd` ``` Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open. ``` ./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd ``` Pre-PR: ``` rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539 ``` Post-PR: ``` rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349 ``` _Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_ ``` class PosixRandomAccessFile : public FSRandomAccessFile { virtual size_t GetRequiredBufferAlignment() const override { - return logical_sector_size_; + return 1; } ``` - CI Reviewed By: pdillinger Differential Revision: D46472566 Pulled By: hx235 fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
Summary: **Context/Summary:** When db is upgrading to adopt [pr11406](#11406), it's possible for RocksDB to infer a small tail size to prefetch for pre-upgrade files. Such small tail size would have caused 1 file read per index or filter partition if partitioned index or filer is used. This PR provides a UT to show this would not happen. Misc: refactor the related UTs a bit to make this new UT more readable. Pull Request resolved: #11522 Test Plan: - New UT If logic of upgrade is wrong e.g, ``` --- a/table/block_based/partitioned_index_reader.cc +++ b/table/block_based/partitioned_index_reader.cc @@ -166,7 +166,8 @@ Status PartitionIndexReader::CacheDependencies( uint64_t prefetch_len = last_off - prefetch_off; std::unique_ptr<FilePrefetchBuffer> prefetch_buffer; if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() || - tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) { + (false && tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off)) { ``` , then the UT will fail like below ``` [ RUN ] PrefetchTailTest/PrefetchTailTest.UpgradeToTailSizeInManifest/0 file/prefetch_test.cc:461: Failure Expected: (db_open_file_read.count) < (num_index_partition), actual: 38 vs 33 Received signal 11 (Segmentation fault) ``` Reviewed By: pdillinger Differential Revision: D46546707 Pulled By: hx235 fbshipit-source-id: 9897b0a975e9055963edac5451fd1cd9d6c45d0e
### What changes were proposed in this pull request? This pr aims to upgrade rocksdbjni from 8.1.1.1 to 8.3.2. ### Why are the changes needed? The release notes: https://github.com/facebook/rocksdb/releases/tag/v8.3.2 - Bug Fixes Reduced cases of illegally using Env::Default() during static destruction by never destroying the internal PosixEnv itself (except for builds checking for memory leaks). (facebook/rocksdb#11538) - Performance Improvements Fixed higher read QPS during DB::Open() reading files created prior to facebook/rocksdb#11406, especially when reading many small file (size < 52 MB) during DB::Open() and partitioned filter or index is used. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manual test: `org.apache.spark.util.kvstore.RocksDBBenchmark` **A.8.1.1.1** ``` count mean min max 95th dbClose 4 0.472 0.413 0.527 0.527 dbCreation 4 263.388 42.032 922.916 922.916 naturalIndexCreateIterator 1024 0.005 0.001 2.646 0.004 naturalIndexDescendingCreateIterator 1024 0.002 0.002 0.055 0.002 naturalIndexDescendingIteration 1024 0.003 0.003 0.021 0.004 naturalIndexIteration 1024 0.009 0.003 3.156 0.012 randomDeleteIndexed 1024 0.017 0.013 0.381 0.023 randomDeletesNoIndex 1024 0.010 0.009 0.032 0.011 randomUpdatesIndexed 1024 0.066 0.025 19.900 0.074 randomUpdatesNoIndex 1024 0.017 0.015 0.380 0.019 randomWritesIndexed 1024 0.097 0.024 52.970 0.093 randomWritesNoIndex 1024 0.019 0.015 1.101 0.021 refIndexCreateIterator 1024 0.002 0.002 0.044 0.002 refIndexDescendingCreateIterator 1024 0.001 0.001 0.013 0.001 refIndexDescendingIteration 1024 0.004 0.003 0.070 0.005 refIndexIteration 1024 0.005 0.003 0.230 0.005 sequentialDeleteIndexed 1024 0.016 0.013 0.104 0.022 sequentialDeleteNoIndex 1024 0.011 0.009 0.044 0.011 sequentialUpdatesIndexed 1024 0.027 0.019 0.660 0.050 sequentialUpdatesNoIndex 1024 0.025 0.016 0.523 0.033 sequentialWritesIndexed 1024 0.030 0.020 1.526 0.040 sequentialWritesNoIndex 1024 0.030 0.017 4.410 0.035 ``` **B.8.3.2** ``` count mean min max 95th dbClose 4 0.488 0.424 0.556 0.556 dbCreation 4 241.375 35.710 850.488 850.488 naturalIndexCreateIterator 1024 0.004 0.001 1.555 0.006 naturalIndexDescendingCreateIterator 1024 0.002 0.002 0.064 0.002 naturalIndexDescendingIteration 1024 0.004 0.003 0.035 0.004 naturalIndexIteration 1024 0.011 0.003 4.464 0.012 randomDeleteIndexed 1024 0.018 0.013 0.505 0.024 randomDeletesNoIndex 1024 0.010 0.009 0.025 0.011 randomUpdatesIndexed 1024 0.065 0.024 20.210 0.077 randomUpdatesNoIndex 1024 0.019 0.015 0.449 0.027 randomWritesIndexed 1024 0.087 0.026 38.782 0.096 randomWritesNoIndex 1024 0.019 0.016 1.040 0.019 refIndexCreateIterator 1024 0.002 0.002 0.051 0.002 refIndexDescendingCreateIterator 1024 0.001 0.001 0.013 0.001 refIndexDescendingIteration 1024 0.004 0.003 0.064 0.004 refIndexIteration 1024 0.005 0.003 0.253 0.006 sequentialDeleteIndexed 1024 0.016 0.013 0.286 0.024 sequentialDeleteNoIndex 1024 0.010 0.009 0.071 0.012 sequentialUpdatesIndexed 1024 0.026 0.020 0.711 0.042 sequentialUpdatesNoIndex 1024 0.024 0.016 0.455 0.030 sequentialWritesIndexed 1024 0.025 0.020 1.359 0.033 sequentialWritesNoIndex 1024 0.028 0.017 4.354 0.030 ``` B.Checked core module UTs with rocksdb live ui ``` export LIVE_UI_LOCAL_STORE_DIR=/${basedir}/spark-ui build/mvn clean install -pl core -am -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest -fn [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 39:09 min [INFO] Finished at: 2023-06-30T09:48:44+08:00 [INFO] ------------------------------------------------------------------------ ``` Closes #41802 from panbingkun/SPARK-44256. Authored-by: panbingkun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
Context:
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through
TailPrefetchStats
heuristics introduced in #4156, which has caused small reads in unlucky case (e.g, small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug. Therefore we decide to record the exact tail size and use it directly to prefetch tail of the SST instead of relying heuristics.Summary:
BlockBasedTableBuilder::Finish()
tail_start_offset
part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).Test:
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
Create DB
ReadRandom
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR